Putting Book Scans PDFs in Scrapbox 2019
2019-10-08
Upload to Gyazo Pro via script after disassembling into images
It takes time, so we get OCR data after a while.
Scanning results from ScanSnap are retrieved in pdfimages. If it's a cut-and-scan PDF, that's OK.
PDFs of slides, etc. are not acceptable.
Locally, folders are cut and stored with MD5 hash.
Sync it to AWS.
That's very kindly written.
Deletion on hand does not delete anything on S3.
Sync to AWS is not really required.
Because I'm sending the contents of the FILE to gyazo.
Use pdftocairo since slides cannot be converted to images with pdfimeges $ pdftocairo -r 200 -f 0 -jpeg <pdf> pages
Multiple PDFs are now combined into a single JSON
pdfstojson.rb calls makejson.rb
I looked into how to do it in Python, but I was able to achieve it by using makejson.rb as a child process.
Download and add the OCR results from Gyazo a while after the JSON is ready.
---
This page is auto-translated from /nishio/書籍スキャンPDFをScrapboxに置く2019 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.